A Methodology Supporting the Design and Evaluating the Final Quality of Data Warehouses

نویسندگان

Maurizio Pighin

Lucio Ieronutti

چکیده

The design and configuration of a data warehouse can be difficult tasks especially in the case of very large databases and in the presence of redundant information. In particular, the choice of which attributes have to be considered as dimensions and measures can be not trivial and it can heavily influence the effectiveness of the final system. In this article, we propose a methodology targeted at supporting the design and deriving information on the total quality of the final data warehouse. We tested our proposal on three real-world commercial ERP databases. IntroductIon and MotIvatIon Information systems allow companies and organizations to collect a large number of transactional data. Starting from this data, datawarehousing provides architectures and tools to derive information at a level of abstraction suitable for supporting decision processes. There are different factors influencing the effectiveness of a data warehouse and the quality of related decisions. For example, while the selection of good-quality operational data enable to better target the decision process in the presence of alternative choices (Chengalur-Smith, Ballou, & Pazer, 1999), poor-quality data cause information scrap and rework that wastes people, money, materials and facilities resources (Ballau, A Methodology Supporting the Design and Evaluating the Final Quality of Data Warehouses 2 Wang, Pazer, & Tayi, 1998; English, 1999; Wang & Strong, 1996a, 1996b). We have recently started at facing the problem of data quality in data warehouses (Pighin & Ieronutti, 2007); at the beginning of our research, we have considered the semantics-based solutions that have been proposed in the literature, and then we moved towards statistical methods, since in a real-world scenario data warehouse-engineers typically have a partial knowledge and vision of a specific operational database (e.g., how an organization really uses the operational system) and related semantics and then they need a support for the selection of data required to build a data warehouse. We then propose a contextindependent methodology that is able both to support the expert during the data warehouse creation and evaluate the final quality of taken design choices. The proposed solution is mainly focused on statistical and syntactical aspects of data rather on semantics and it is based on a set of metrics, each one designed with the aim of capturing a particular data feature. However, since most design choices are based on semantic considerations, our goal is to propose a solution that can be coupled with semantics-based techniques (for instance the one proposed by Golfarelli, Maio, and Rizzi (1998)) to effectively drive design choices. In particular, our methodology results effective in the following situations: • During the construction phase, it is able to drive the selection of an attribute in the case of multiple choices (i.e., redundant information); for example, when an attribute belongs to different tables of a given database or belongs to different databases (that is the typical scenario in these kind of applications). Additionally, it is able to evaluate the quality of each choice (i.e., the informative value added to the final data warehouse choosing a table and its attribute as measure or dimension). • At the end of the data warehouse design, it measures in quantitative terms the final quality of the data warehouse. Moreover, in the case of data warehouses based on the same design choices (characterized by the same schema), our methodology is also able to evaluate how data really stored into the initial database influences the informative content of the resulting data warehouse. To evaluate the effectiveness of our methodology in identifying attributes that are more suitable to be used as dimensions and measures, we have experimented proposed metrics on three real ERP (Enterprise Resource Planning) commercial systems. Two systems are based on a DB Informix running on Unix server and one is based on a DB Oracle running on Windows server. In the experiment, they are called respectively DB01, DB02 and DB03. More specifically, our metrics have been tested on data collected by the selling subsystems. In this article, we refer to measures and dimensions related to the data warehouse, and to metrics as the indexes defined in the methodology we propose for evaluating data quality and reliability. Moreover, we use DW and DB to identify respectively a decisional data warehouse and an operational database.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Data mining for decision making in engineering optimal design

Often in modeling the engineering optimization design problems, the value of objective function(s) is not clearly defined in terms of design variables. Instead it is obtained by some numerical analysis such as FE structural analysis, fluid mechanic analysis, and thermodynamic analysis, etc. Yet, the numerical analyses are considerably time consuming to obtain the final value of objective functi...

متن کامل

A Proposed Data Mining Methodology and its Application to Industrial Procedures

Data mining is the process of discovering correlations, patterns, trends or relationships by searching through a large amount of data stored in repositories, corporate databases, and data warehouses. Industrial procedures with the help of engineers, managers, and other specialists, comprise a broad field and have many tools and techniques in their problem-solving arsenal. The purpose of this st...

متن کامل

A Three-Echelon Multi-Objective Multi-Period Multi-Product Supply Chain Network Design Problem: A Goal Programming Approach

In this paper, a multi-objective multi-period multi-product supply chain network design problem is introduced. This problem is modeled using a multi-objective mixed integer mathematical programming. The objectives are maximizing the total profit of logistics, maximizing service level, and minimizing inconsistency of operations. Several sets of constraints are considered to handle the real situa...

متن کامل

A Methodology for Product Performance Analysis under Effects of Multi-Physical Phenomena

Due to the development of science and technology, the computer has become a useful tool for supporting engineering activities in product design. Many computer aided tools such as CAD/CAM, product data management (PDM), product life cycle assessment (PLA), etc., have been popularly used in industry for reducing product development lead-time and increasing total product quality. However, the nume...

متن کامل

Simulation-Based Optimization for Improving Hospital Performance

Background and Objectives: Nowadays health services affect a significant part of social, economic and political parts of each country. In this case, hospitals are considered as the important and final stage of health service supply chain. Consequently, quality of health services offered by hospitals has a straight impact on the safety of individuals. Methods: </st...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

IJDWM

دوره 4 شماره

صفحات -

تاریخ انتشار 2008

A Methodology Supporting the Design and Evaluating the Final Quality of Data Warehouses

نویسندگان

چکیده

منابع مشابه

Data mining for decision making in engineering optimal design

A Proposed Data Mining Methodology and its Application to Industrial Procedures

A Three-Echelon Multi-Objective Multi-Period Multi-Product Supply Chain Network Design Problem: A Goal Programming Approach

A Methodology for Product Performance Analysis under Effects of Multi-Physical Phenomena

Simulation-Based Optimization for Improving Hospital Performance

عنوان ژورنال:

اشتراک گذاری